#Steps planning
# 1. Load and process knowledge graph Data
# 1.1 Data Cleaning, handling missing value, transforming data, select desired columns
# 2. Focus on Fish related companies
#Since we are going to focus on fishing business anomalies, we will need to filter out the subset of data relate to fishing activities. We have to do a bit of text sensing
## 2.1 Word tokenization for business key activity extraction, then filter for fishing related only
## 2.2. Transform the business relationship by counting company to company, company to owner, company to contact information etc, categorize relationship manually, give some labels
# Then among the business groups, we can do exploratary analysis on the relationship, explore those with top number of connections
# 3. Visualizing pattern of groups
# 3.1 Apply clustering algorithms to identify groups of related entities within the knowledge graph. Use techniques such as community detection algorithms (e.g., Louvain algorithm) to detect clusters.
# 3.2 Visualize the identified groups using appropriate visualizations, such as a network graph or a dendrogram, to understand the patterns and relationships among the groups.
# 4. Identifying anomalies in business groups:
#4.1 Try to define anomaly detection metrics based on known characteristics of IUU fishing companies. Can consider things like the number of connections, financial transactions, or unusual ownership structures.
# Calculate the anomaly scores for each business group based on the defined metrics. Use appropriate statistical techniques or anomaly detection algorithms like Isolation Forest or Local Outlier Factor. R packages like anomalyDetection and outliers can be used for this purpose.
# 5. Measuring similarity and expressing confidence in groupings:
# Calculate similarity scores or distances between businesses within each group using appropriate similarity measures, such as cosine similarity or Euclidean distance.
# Visualize the similarity measures to express confidence in the groupings. May use Heatmaps or dendrogramsTakeHomeEx3
Objective definition:
FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.
FishEye analysts have attempted to use traditional node-link visualizations and standard graph analyses, but these were found to be ineffective because the scale and detail in the data can obscure a business’s true structure.
The research below aim to help FishEye develop a new visual analytics approach to better understand fishing business anomalies.
We will use visual analytics to understand patterns of groups in the knowledge graph and highlight anomalous groups.
Task 1: Use visual analytics to identify anomalies in the business groups present in the knowledge graph.
Task 2: Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user.
1. Data Pre-processing and cleaning
Load the library and read the json relationship file MC2.
#echo | false
#tidytext -- text mining library with R: https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
#Load Libraries
pacman::p_load(jsonlite,tidygraph, ggraph, visNetwork, tidyverse, shiny, plotly, graphlayouts, ggforce, tidytext,skimr)
#load Data
MC3<- fromJSON("data/MC3.json")Data Cleaning for MC3 Nodes and Edges
We picked the desired fields and reorganized the columns using select function. The nodes in MC3 will be companies or person, and description about companies, with their product and services, country and revenue generated.
As we load the data, we found this diagram is not directed, so we will not know the in/out direction of connection.
Below code extract out nodes and edges out for further processing.
#glimpse(MC3)
MC3_nodes <- as_tibble(MC3$nodes)
colSums(is.na(MC3_nodes)) country id product_services revenue_omu
0 0 0 0
type
0
#Extract and mutate the format so it's not list but dataframe
MC3_nodes_clean <- MC3_nodes %>% mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)), #we need to convert to numeric directly
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services)Warning: There was 1 warning in `mutate()`.
ℹ In argument: `revenue_omu = as.numeric(as.character(revenue_omu))`.
Caused by warning:
! NAs introduced by coercion
The original data do not have NA value, however by transforming data into table format, some fields are NA.
#check data quality, find missing value
colSums(is.na(MC3_nodes_clean)) id country type revenue_omu
0 0 0 21515
product_services
0
#check which are the types?
# unique(MC3_nodes_clean$type)
skim(MC3_nodes_clean)| Name | MC3_nodes_clean |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
Out of the total Nodes 21515 out of 27622 rows do not have value for revenue_omu. The ratio of missing value in revenue_omu is 77.9%. We will need to deal with this Missing values. And there are 22929 out of 27622 rows have unique ids, there are duplicates with id. The ratio of non-duplicate id is 83.0%.
#check which are the duplicate ids
duplicate_ids <- MC3_nodes_clean[duplicated(MC3_nodes_clean$id), "id"]
# Create a data frame to store the duplicate IDs and corresponding rows
duplicate_records <- data.frame(id = character(),
country = character(),
type = character(),
revenue_omu = character(),
product_services = character(),
stringsAsFactors = FALSE)
# Applying filtering to remove duplicates
# If two rows with duplicate id but with different value in any other 4 columns (country, type, revenue_omu, product_services), keep the duplicate id rows
# if two rows with duplicate id and same value in all other 4 columns (country, type, revenue_omu, product_services), remove the duplicate id row, only keep 1 such row.
#use R base function to achieve this
MC3_nodes_clean_noDup <- MC3_nodes_clean[!duplicated(MC3_nodes_clean), ]
duplicate_ids_clean <- MC3_nodes_clean_noDup[duplicated(MC3_nodes_clean_noDup$id), "id"]
DT::datatable(MC3_nodes_clean)Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
MC3_edges <- as_tibble(MC3$links) %>%
distinct() %>%
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
group_by(source, target, type) %>%
summarise(weights = n()) %>%
filter(source!=target) %>%
ungroup()`summarise()` has grouped output by 'source', 'target'. You can override using
the `.groups` argument.
#check missing value
colSums(is.na(MC3_edges)) source target type weights
0 0 0 0
There is no missing value in edges data. Explore the dataset.
DT::datatable(MC3_edges)Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
#check which are the types?
unique(MC3_edges$type)[1] "Company Contacts" "Beneficial Owner"
#datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.In order to find the business group, we will check the type of different category of data. There might be owner - business, customer - business, business - business relationship
ggplot(data = MC3_edges,
aes(x = type)) +
geom_bar()
MC3_edges_clean <- MC3_edges %>% mutate(source = as.character(source),
target = as.character(target),
type = as.character(target)) %>%
group_by(source, target, type) %>%
summarise(weights = n()) %>%
filter(source!=target) %>%
ungroup()`summarise()` has grouped output by 'source', 'target'. You can override using
the `.groups` argument.
Handling missing values: for some or the product/services, there’s blank value such as “character(0)”, we recode these value to NA these value before pass them for text sensing.:
# Recode "character(0)" to NA in the product_services column
MC3_nodes_clean$product_services[MC3_nodes_clean$product_services == "character(0)"] <- NA
ggplot(data = MC3_nodes_clean,
aes(x = type)) +
geom_bar()
Building network model with tidygraph
id1 <- MC3_edges_clean %>%
select(source) %>%
rename(id = source)
id2 <- MC3_edges_clean %>%
select(target) %>%
rename(id = target)
MC3_nodes1 <- rbind(id1, id2) %>%
distinct() %>%
left_join(MC3_nodes_clean,
unmatched = "drop")Joining with `by = join_by(id)`
mc3_graph <- tbl_graph(nodes = MC3_nodes1,
edges = MC3_edges_clean,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness()) mc3_graph %>%
filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
theme_graph()Warning in geom_node_point(aes(size = betweenness_centrality, colors =
"lightblue", : Ignoring unknown aesthetics: colours
Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Text sensing with tidytext
word count
#start a bit of text sensing, display the result by max value first
MC3_nodes_clean <- MC3_nodes_clean %>%
mutate(n_fish = str_count(product_services, "fish")) %>%
arrange(desc(n_fish))
library(ggplot2)
# MC3_nodes_clean <- MC3_nodes_clean %>%
# group_by(type) %>%
# summarize(n_fish = sum(str_count(product_services, "fish")), .groups = "drop")
ggplot(data = MC3_nodes_clean, aes(x = type, y = n_fish)) +
geom_bar(stat = "identity")Warning: Removed 18959 rows containing missing values (`position_stack()`).

ggplot(data = MC3_nodes_clean, aes(x = type, y = revenue_omu)) +
geom_bar(stat = "identity")Warning: Removed 21515 rows containing missing values (`position_stack()`).

Tokenisation
In text sensing, tokenisation is the process of breaking up a given text into units called tokens. We will discard characters like punctuation marks in this progress.
The two basic arguments to unnest_tokens() used here are column names. First argument is the output column name that will be created as the text is unnested into it, and then the input column that the text comes from (product_services, in this case).
nodesToken <- MC3_nodes_clean %>%
unnest_tokens (word, product_services)
#can add in to_lower = TRUE
# add in strip_punct = TRUE nodesToken %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")Selecting by n

With above we saw the top frequently sensed word may not be useful. For example, NA, “a” and “to”. We will need to remove these words as stop words.
From the token generated, we need to take out the common/generic words and we will also exclude NA records.
tidy_stopwords <- nodesToken %>%
anti_join(stop_words)%>%
na.omit()Joining with `by = join_by(word)`
Visualization with bar chart after remove stopword
tidy_stopwords %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")Selecting by n

#use parallel cordinate to visualize
# library(cluster)
# library(caret)
#
# MC3_nodes <- MC3_nodes %>%
# select(product_services, country, revenue_omu, type) %>%
# na.omit()
#
#
#
# #prepare data
# clustering_data <- MC3_nodes[, c("product_services", "revenue_omu", "type")]
# #try K means clustering
# k <- 4 # Number of clusters
# set.seed(123) # For reproducibility
# kmeans_result <- kmeans(clustering_data, centers = k)
#
# MC3_nodes$cluster <- as.factor(kmeans_result$cluster)
# cluster_summary <- aggregate(clustering_data, by = list(cluster = MC3_nodes$cluster), FUN = mean)
#
#
#
# pacman::p_load(GGally, parallelPlot)
# library(GGally)
# ggparcoord(MC3_nodes[, c("product_services", "country", "revenue_omu", "type","cluster")],
# columns = 1:3, groupColumn = "cluster",
# title = "Parallel Coordinate Plot: Features by Cluster")
# ploting relationship?
TODO - failed need troubleshoot
# GraphMC3 <- tbl_graph(nodes = MC3_nodes_clean,
# edges = MC3_edges_clean,
# directed = FALSE)
# #
# GraphMC3
#is_connected <- is.connected(GraphMC2)
# peopleEntityRelationship %>%
# activate(edges) %>%
# arrange(desc(weightkg))